The objective of this study is to identify neighborhoods in New York City that offer a high quality of life, taking into account the rent prices, the variation in noise complaints, and crime rates across different areas. By analyzing these factors, the livability conditions of various areas can be recognized. This study will benefit potential renters and policymakers who are looking to either move within NYC or allocate resources effectively in the city’s diverse neighborhoods.
The research will be conducted through the following questions:
How are rent prices expected to change in the next two years across New York City?
How do the noise complaint types vary across neighborhoods in New York City?
How do the crimes vary across neighborhoods in New York City?
Data Source: https://data.cityofnewyork.us/Social-Services/311-Noise-Complaints/p5f6-bkga/about_data
raw_noise =read.csv("/Users/tianlanmac16/Desktop/Columbia AA/5205-R/StatSquad/NOISE/RAWNOISE_311_Noise_Complaints_20240222.csv")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(skimr)
To get summarized and quantitative data from the raw noise data set, we will extract key info columns to analyze and aggregate including:
Unique.Key: unique identifier of each complaint Incident.Zip: zipcode to locate Community.Board: a match to precint to locate Borough: borough names Created.Date: Exact date of occurrence for the reported event Complaint.Type:Residential/Commercial..etc type of complaints Descriptor: detail description on noise category Latitude Longitude
noise_subset = raw_noise %>%
select(Unique.Key, Created.Date, Community.Board, Borough, Incident.Zip, Incident.Address, Complaint.Type, Descriptor, Latitude, Longitude)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
noise_subset$Created.Date = mdy_hms(noise_subset$Created.Date)
filtered_noise_subset <- noise_subset %>%
filter(year(Created.Date) >= 2012 & year(Created.Date) <= 2022)
library(dplyr)
library(stringr)
imputed_noise <- filtered_noise_subset %>%
mutate(Borough = str_replace(Borough, "^$", "Unspecified"),
Community.Board = str_replace(Community.Board, "^$", "0 Unspecified"))
skim(imputed_noise)
| Name | imputed_noise |
| Number of rows | 5309712 |
| Number of columns | 10 |
| _______________________ | |
| Column type frequency: | |
| character | 5 |
| numeric | 4 |
| POSIXct | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Community.Board | 0 | 1 | 8 | 25 | 0 | 77 | 0 |
| Borough | 0 | 1 | 5 | 13 | 0 | 6 | 0 |
| Incident.Address | 0 | 1 | 0 | 59 | 338563 | 533660 | 0 |
| Complaint.Type | 0 | 1 | 5 | 24 | 0 | 10 | 0 |
| Descriptor | 0 | 1 | 0 | 71 | 2 | 35 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Unique.Key | 0 | 1.00 | 41413536.15 | 9743397.46 | 22425840.00 | 33219755.50 | 42596970.50 | 50390328.25 | 56556254.00 | ▅▆▅▆▇ |
| Incident.Zip | 10712 | 1.00 | 10694.17 | 556.46 | 0.00 | 10036.00 | 10467.00 | 11225.00 | 12345.00 | ▁▁▁▁▇ |
| Latitude | 32594 | 0.99 | 40.76 | 0.08 | 40.50 | 40.69 | 40.75 | 40.83 | 40.91 | ▁▃▇▆▆ |
| Longitude | 32594 | 0.99 | -73.92 | 0.07 | -74.26 | -73.96 | -73.93 | -73.88 | -73.70 | ▁▁▇▆▁ |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| Created.Date | 0 | 1 | 2012-01-01 00:00:11 | 2022-12-31 23:59:13 | 2019-05-06 01:05:41 | 5141605 |
unique(imputed_noise$Borough)
## [1] "MANHATTAN" "BRONX" "QUEENS" "BROOKLYN"
## [5] "STATEN ISLAND" "Unspecified"
unique(imputed_noise$Community.Board)
## [1] "07 MANHATTAN" "11 BRONX"
## [3] "12 QUEENS" "02 QUEENS"
## [5] "11 QUEENS" "07 BROOKLYN"
## [7] "10 MANHATTAN" "04 BRONX"
## [9] "09 QUEENS" "03 QUEENS"
## [11] "02 BRONX" "03 BROOKLYN"
## [13] "05 BRONX" "04 BROOKLYN"
## [15] "09 BRONX" "06 BRONX"
## [17] "01 BROOKLYN" "04 QUEENS"
## [19] "06 MANHATTAN" "01 STATEN ISLAND"
## [21] "12 MANHATTAN" "02 BROOKLYN"
## [23] "08 BROOKLYN" "01 MANHATTAN"
## [25] "12 BRONX" "16 BROOKLYN"
## [27] "09 MANHATTAN" "12 BROOKLYN"
## [29] "09 BROOKLYN" "05 QUEENS"
## [31] "07 BRONX" "06 BROOKLYN"
## [33] "13 QUEENS" "08 BRONX"
## [35] "01 QUEENS" "05 MANHATTAN"
## [37] "05 BROOKLYN" "11 BROOKLYN"
## [39] "01 BRONX" "02 MANHATTAN"
## [41] "07 QUEENS" "11 MANHATTAN"
## [43] "02 STATEN ISLAND" "17 BROOKLYN"
## [45] "10 BROOKLYN" "08 MANHATTAN"
## [47] "04 MANHATTAN" "03 BRONX"
## [49] "18 BROOKLYN" "14 BROOKLYN"
## [51] "10 BRONX" "03 MANHATTAN"
## [53] "14 QUEENS" "08 QUEENS"
## [55] "03 STATEN ISLAND" "10 QUEENS"
## [57] "81 QUEENS" "15 BROOKLYN"
## [59] "13 BROOKLYN" "06 QUEENS"
## [61] "Unspecified MANHATTAN" "55 BROOKLYN"
## [63] "64 MANHATTAN" "Unspecified BRONX"
## [65] "0 Unspecified" "95 STATEN ISLAND"
## [67] "27 BRONX" "Unspecified BROOKLYN"
## [69] "Unspecified QUEENS" "Unspecified STATEN ISLAND"
## [71] "82 QUEENS" "84 QUEENS"
## [73] "28 BRONX" "80 QUEENS"
## [75] "26 BRONX" "83 QUEENS"
## [77] "56 BROOKLYN"
Due to limited time and computing power, instead of cleaning unspecified areas and zipcode, we will delete the data with missing values, with an approximate of 0.5% of our raw data to proceed.
cleaned_noise = imputed_noise %>%
filter(!(str_detect(Community.Board, "^0 Unspecified"))) %>%
filter(!(str_detect(Community.Board, "^Unspecified"))) %>%
filter(!(str_detect(Descriptor, "^\\s*$")))
cleaned_noise = cleaned_noise[!is.na(cleaned_noise$Incident.Zip), ]
nrow(cleaned_noise)
## [1] 5279309
After reduction, we will keep 99.43% of our raw data in the Noise dataset for quantitative summary.
sprintf("%.10f", 5279309/5309712) # we kept 99.43% of the raw data
## [1] "0.9942740774"
cleaned_noise$Incident.Zip <- as.character(cleaned_noise$Incident.Zip)
cleaned_noise_final <- cleaned_noise %>%
mutate(Borough = ifelse(Borough == 'MANHATTAN' & Community.Board == '08 BRONX', 'BRONX', Borough)) %>%
mutate(Borough = ifelse(Borough == 'BRONX' & Community.Board == '01 QUEENS', 'QUEENS', Borough))
cleaned_noise_final %>%
filter(Borough == 'MANHATTAN')%>%
group_by(Community.Board) %>%
summarise(complaints = n())
## # A tibble: 13 × 2
## Community.Board complaints
## <chr> <int>
## 1 01 MANHATTAN 56867
## 2 02 MANHATTAN 96231
## 3 03 MANHATTAN 171221
## 4 04 MANHATTAN 109874
## 5 05 MANHATTAN 72173
## 6 06 MANHATTAN 79380
## 7 07 MANHATTAN 136185
## 8 08 MANHATTAN 86978
## 9 09 MANHATTAN 132066
## 10 10 MANHATTAN 167450
## 11 11 MANHATTAN 100502
## 12 12 MANHATTAN 316333
## 13 64 MANHATTAN 3118
noise_summary<- cleaned_noise_final%>%
mutate(year = year(Created.Date),
month = month(Created.Date)) %>%
group_by(Borough, Community.Board, Incident.Zip, Complaint.Type, Descriptor, year, month) %>%
summarise(complaints = n())
## `summarise()` has grouped output by 'Borough', 'Community.Board',
## 'Incident.Zip', 'Complaint.Type', 'Descriptor', 'year'. You can override using
## the `.groups` argument.
noise_summary <- noise_summary %>%
arrange(Incident.Zip, year, month, Complaint.Type, Descriptor)
write.csv(noise_summary, 'noise_summary.csv',row.names = F)
write.csv(cleaned_noise_final, 'cleaned_noise_final.csv',row.names = F)
Data Source: https://www.nyc.gov/site/nypd/stats/crime-statistics/borough-and-precinct-crime-stats.page#manhattan
library(dplyr)
library(skimr)
raw_crime=read.csv("//Users/tianlanmac16/Desktop/Columbia AA/5205-R/StatSquad/RECLEAN/RAW_NYPD_Complaint_Data 2012-2022.csv")
skim(raw_crime)
| Name | raw_crime |
| Number of rows | 11405 |
| Number of columns | 36 |
| _______________________ | |
| Column type frequency: | |
| character | 26 |
| numeric | 10 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| CMPLNT_NUM | 0 | 1 | 9 | 14 | 0 | 11405 | 0 |
| BORO_NM | 0 | 1 | 5 | 13 | 0 | 6 | 0 |
| CMPLNT_FR_DT | 0 | 1 | 10 | 10 | 0 | 1478 | 0 |
| CMPLNT_FR_TM | 0 | 1 | 8 | 8 | 0 | 1007 | 0 |
| CMPLNT_TO_DT | 0 | 1 | 0 | 10 | 1175 | 1356 | 0 |
| CMPLNT_TO_TM | 0 | 1 | 6 | 8 | 0 | 990 | 0 |
| CRM_ATPT_CPTD_CD | 0 | 1 | 9 | 9 | 0 | 2 | 0 |
| HADEVELOPT | 0 | 1 | 4 | 15 | 0 | 14 | 0 |
| JURIS_DESC | 0 | 1 | 5 | 24 | 0 | 16 | 0 |
| LAW_CAT_CD | 0 | 1 | 6 | 11 | 0 | 3 | 0 |
| LOC_OF_OCCUR_DESC | 0 | 1 | 6 | 11 | 0 | 6 | 0 |
| OFNS_DESC | 0 | 1 | 4 | 36 | 0 | 45 | 0 |
| PARKS_NM | 0 | 1 | 6 | 30 | 0 | 30 | 0 |
| PATROL_BORO | 0 | 1 | 17 | 25 | 0 | 8 | 0 |
| PD_DESC | 0 | 1 | 6 | 71 | 0 | 218 | 0 |
| PREM_TYP_DESC | 0 | 1 | 3 | 28 | 0 | 76 | 0 |
| RPT_DT | 0 | 1 | 10 | 10 | 0 | 363 | 0 |
| STATION_NAME | 0 | 1 | 6 | 30 | 0 | 66 | 0 |
| SUSP_AGE_GROUP | 0 | 1 | 3 | 7 | 0 | 9 | 0 |
| SUSP_RACE | 0 | 1 | 5 | 30 | 0 | 8 | 0 |
| SUSP_SEX | 0 | 1 | 1 | 6 | 0 | 4 | 0 |
| VIC_AGE_GROUP | 0 | 1 | 3 | 7 | 0 | 8 | 0 |
| VIC_RACE | 0 | 1 | 5 | 30 | 0 | 8 | 0 |
| VIC_SEX | 0 | 1 | 1 | 1 | 0 | 5 | 0 |
| Lat_Lon | 0 | 1 | 0 | 40 | 1 | 7729 | 0 |
| New.Georeferenced.Column | 0 | 1 | 0 | 45 | 1 | 7729 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| ADDR_PCT_CD | 4 | 1.00 | 65.93 | 35.01 | 1.00 | 41.00 | 67.00 | 103.00 | 123.00 | ▅▆▆▃▇ |
| HOUSING_PSA | 10760 | 0.06 | 7529.66 | 14632.45 | 218.00 | 463.00 | 696.00 | 1291.00 | 47683.00 | ▇▁▁▁▁ |
| JURISDICTION_CODE | 0 | 1.00 | 1.49 | 10.08 | 0.00 | 0.00 | 0.00 | 0.00 | 97.00 | ▇▁▁▁▁ |
| KY_CD | 0 | 1.00 | 247.88 | 149.82 | 101.00 | 109.00 | 233.00 | 341.00 | 678.00 | ▇▁▆▁▂ |
| PD_CD | 18 | 1.00 | 424.82 | 193.12 | 101.00 | 321.00 | 407.00 | 638.00 | 922.00 | ▅▇▃▆▁ |
| TRANSIT_DISTRICT | 11223 | 0.02 | 12.23 | 11.10 | 1.00 | 3.00 | 11.00 | 20.00 | 34.00 | ▇▆▁▁▃ |
| X_COORD_CD | 1 | 1.00 | 1005269.21 | 22415.07 | 913511.00 | 991626.50 | 1004593.00 | 1017933.00 | 1066893.00 | ▁▁▇▆▂ |
| Y_COORD_CD | 1 | 1.00 | 206258.29 | 30351.98 | 123626.00 | 183890.00 | 205330.00 | 232216.75 | 271303.00 | ▁▅▇▆▃ |
| Latitude | 1 | 1.00 | 40.73 | 0.08 | 40.51 | 40.67 | 40.73 | 40.80 | 40.91 | ▁▅▇▆▃ |
| Longitude | 1 | 1.00 | -73.92 | 0.08 | -74.25 | -73.97 | -73.93 | -73.88 | -73.70 | ▁▁▇▆▂ |
To get summarized and quntitative data from the raw crime dataset, we will extract key info columns to analyze and aggregate including: CMPLNT_NUM: unique identifier of each case ADDR_PCT_CD: The precinct in which the incident occurred BORO_NM: borough names CMPLNT_FR_DT: Exact date of occurrence for the reported event LAW_CAT_CD: Level of offense: felony, misdemeanor, violation OFNS_DESC: Description of offense corresponding with key code Latitude Longitude with above columns, we will be able to summarize the total crime numbers each month by location and crime type:
crime_subset = raw_crime %>%
select(CMPLNT_NUM, CMPLNT_FR_DT, BORO_NM, ADDR_PCT_CD, LAW_CAT_CD, OFNS_DESC, Latitude, Longitude)
crime_subset %>%
filter(is.na(ADDR_PCT_CD))
## CMPLNT_NUM CMPLNT_FR_DT BORO_NM ADDR_PCT_CD LAW_CAT_CD
## 1 271421229H1 07/05/2020 BROOKLYN NA FELONY
## 2 270324746H1 08/06/2022 BRONX NA FELONY
## 3 270325369H1 07/22/2022 BRONX NA FELONY
## 4 268272009H1 01/02/2022 QUEENS NA FELONY
## OFNS_DESC Latitude Longitude
## 1 MURDER & NON-NEGL. MANSLAUGHTER NA NA
## 2 MURDER & NON-NEGL. MANSLAUGHTER 40.87584 -73.90313
## 3 MURDER & NON-NEGL. MANSLAUGHTER 40.87366 -73.89837
## 4 MURDER & NON-NEGL. MANSLAUGHTER 40.68067 -73.84272
Based on the longitude and latitude data, we can identify the precint info and will impute as below: 270325369H1 50 bronx 270324746H1 50 bronx 268272009H1 106 queens 271421229H1 67 brooklyn
crime_subset <- crime_subset %>%
mutate(ADDR_PCT_CD = case_when(
CMPLNT_NUM == "270325369H1" ~ 50,
CMPLNT_NUM == "270324746H1" ~ 50,
CMPLNT_NUM == "268272009H1" ~ 106,
CMPLNT_NUM == "271421229H1" ~ 67,
TRUE ~ ADDR_PCT_CD
))
#cross checking imputed data
crime_subset %>%
filter(CMPLNT_NUM %in% c("270325369H1", "270324746H1", "268272009H1", "271421229H1"))
## CMPLNT_NUM CMPLNT_FR_DT BORO_NM ADDR_PCT_CD LAW_CAT_CD
## 1 271421229H1 07/05/2020 BROOKLYN 67 FELONY
## 2 270324746H1 08/06/2022 BRONX 50 FELONY
## 3 270325369H1 07/22/2022 BRONX 50 FELONY
## 4 268272009H1 01/02/2022 QUEENS 106 FELONY
## OFNS_DESC Latitude Longitude
## 1 MURDER & NON-NEGL. MANSLAUGHTER NA NA
## 2 MURDER & NON-NEGL. MANSLAUGHTER 40.87584 -73.90313
## 3 MURDER & NON-NEGL. MANSLAUGHTER 40.87366 -73.89837
## 4 MURDER & NON-NEGL. MANSLAUGHTER 40.68067 -73.84272
now, assign district by precinct:
crime_subset <- crime_subset %>%
mutate(BORO_NM = case_when(
ADDR_PCT_CD >= 1 & ADDR_PCT_CD <= 35 ~ "MANHATTAN",
ADDR_PCT_CD >= 40 & ADDR_PCT_CD <= 52 ~ "BRONX",
ADDR_PCT_CD >= 60 & ADDR_PCT_CD <= 94 ~ "BROOKLYN",
ADDR_PCT_CD >= 100 & ADDR_PCT_CD <= 115 ~ "QUEENS",
ADDR_PCT_CD >= 120 & ADDR_PCT_CD <= 125 ~ "STATEN ISLAND",
TRUE ~ BORO_NM
))
unique(crime_subset$BORO_NM)
## [1] "BROOKLYN" "STATEN ISLAND" "MANHATTAN" "QUEENS"
## [5] "BRONX"
based on skimming the data, all 4 missing in OFNS_DESC are OBSCENITY based on the col PL_DESC
crime_subset <- crime_subset %>%
mutate(OFNS_DESC = ifelse(OFNS_DESC == "(null)", "OBSCENIT", OFNS_DESC))
unique(crime_subset$OFNS_DESC)
## [1] "MURDER & NON-NEGL. MANSLAUGHTER"
## [2] "THEFT-FRAUD"
## [3] "SEX CRIMES"
## [4] "PETIT LARCENY"
## [5] "GRAND LARCENY"
## [6] "HARRASSMENT 2"
## [7] "OFF. AGNST PUB ORD SENSBLTY &"
## [8] "FRAUDS"
## [9] "MISCELLANEOUS PENAL LAW"
## [10] "CRIMINAL MISCHIEF & RELATED OF"
## [11] "GRAND LARCENY OF MOTOR VEHICLE"
## [12] "ASSAULT 3 & RELATED OFFENSES"
## [13] "VEHICLE AND TRAFFIC LAWS"
## [14] "OTHER OFFENSES RELATED TO THEF"
## [15] "OTHER STATE LAWS (NON PENAL LA"
## [16] "RAPE"
## [17] "OBSCENIT"
## [18] "BURGLARY"
## [19] "UNAUTHORIZED USE OF A VEHICLE"
## [20] "ROBBERY"
## [21] "FELONY ASSAULT"
## [22] "DANGEROUS DRUGS"
## [23] "POSSESSION OF STOLEN PROPERTY"
## [24] "NYS LAWS-UNCLASSIFIED FELONY"
## [25] "FORGERY"
## [26] "CRIMINAL TRESPASS"
## [27] "OFFENSES AGAINST PUBLIC ADMINI"
## [28] "ARSON"
## [29] "OFFENSES INVOLVING FRAUD"
## [30] "ANTICIPATORY OFFENSES"
## [31] "OFFENSES AGAINST THE PERSON"
## [32] "DANGEROUS WEAPONS"
## [33] "AGRICULTURE & MRKTS LAW-UNCLASSIFIED"
## [34] "PROSTITUTION & RELATED OFFENSES"
## [35] "ADMINISTRATIVE CODE"
## [36] "THEFT OF SERVICES"
## [37] "JOSTLING"
## [38] "INTOXICATED & IMPAIRED DRIVING"
## [39] "KIDNAPPING & RELATED OFFENSES"
## [40] "OFFENSES AGAINST PUBLIC SAFETY"
## [41] "HOMICIDE-NEGLIGENT,UNCLASSIFIE"
## [42] "OFFENSES RELATED TO CHILDREN"
## [43] "FRAUDULENT ACCOSTING"
## [44] "OTHER STATE LAWS"
## [45] "PETIT LARCENY OF MOTOR VEHICLE"
library(lubridate)
crime_subset$CMPLNT_FR_DT = mdy(crime_subset$CMPLNT_FR_DT)
Based on the cleaned file, there is only one record without geographical latitude and longitude, we will impute this cell with mice
skim(crime_subset)
| Name | crime_subset |
| Number of rows | 11405 |
| Number of columns | 8 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| Date | 1 |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| CMPLNT_NUM | 0 | 1 | 9 | 14 | 0 | 11405 | 0 |
| BORO_NM | 0 | 1 | 5 | 13 | 0 | 5 | 0 |
| LAW_CAT_CD | 0 | 1 | 6 | 11 | 0 | 3 | 0 |
| OFNS_DESC | 0 | 1 | 4 | 36 | 0 | 45 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| CMPLNT_FR_DT | 0 | 1 | 2012-01-01 | 2022-12-31 | 2022-10-11 | 1478 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| ADDR_PCT_CD | 0 | 1 | 65.93 | 35.00 | 1.00 | 41.00 | 67.00 | 103.00 | 123.00 | ▅▆▆▃▇ |
| Latitude | 1 | 1 | 40.73 | 0.08 | 40.51 | 40.67 | 40.73 | 40.80 | 40.91 | ▁▅▇▆▃ |
| Longitude | 1 | 1 | -73.92 | 0.08 | -74.25 | -73.97 | -73.93 | -73.88 | -73.70 | ▁▁▇▆▂ |
since we are using crime dataset with spatial analysis, so we will need to impute this 1 missing value in longitude and latitude
library(mice)
##
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
##
## filter
## The following objects are masked from 'package:base':
##
## cbind, rbind
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
mice_crime = mice::complete(mice(crime_subset,seed = 617))
##
## iter imp variable
## 1 1 Latitude Longitude
## 1 2 Latitude Longitude
## 1 3 Latitude Longitude
## 1 4 Latitude Longitude
## 1 5 Latitude Longitude
## 2 1 Latitude Longitude
## 2 2 Latitude Longitude
## 2 3 Latitude Longitude
## 2 4 Latitude Longitude
## 2 5 Latitude Longitude
## 3 1 Latitude Longitude
## 3 2 Latitude Longitude
## 3 3 Latitude Longitude
## 3 4 Latitude Longitude
## 3 5 Latitude Longitude
## 4 1 Latitude Longitude
## 4 2 Latitude Longitude
## 4 3 Latitude Longitude
## 4 4 Latitude Longitude
## 4 5 Latitude Longitude
## 5 1 Latitude Longitude
## 5 2 Latitude Longitude
## 5 3 Latitude Longitude
## 5 4 Latitude Longitude
## 5 5 Latitude Longitude
## Warning: Number of logged events: 4
write.csv(mice_crime, 'crime_cleaned_for_Javier.csv',row.names = F)
#by borough - precinct- law- offense type- year- months
crime_summary<- mice_crime %>%
mutate(year = year(CMPLNT_FR_DT),
month = month(CMPLNT_FR_DT)) %>%
group_by(BORO_NM, ADDR_PCT_CD , LAW_CAT_CD, OFNS_DESC, year, month) %>%
summarise(complaints = n())
## `summarise()` has grouped output by 'BORO_NM', 'ADDR_PCT_CD', 'LAW_CAT_CD',
## 'OFNS_DESC', 'year'. You can override using the `.groups` argument.
crime_summary <- crime_summary %>%
arrange(ADDR_PCT_CD, year, month, LAW_CAT_CD, OFNS_DESC)
Data Source: https://streeteasy.com/blog/data-dashboard
library(dplyr)
library(tidyr)
rental_data <- read.csv("medianAskingRent_All.csv")
#Clean Data
cleaned_rentaldata <- rental_data |>
pivot_longer(cols = 4:172, names_to = "Date", values_to = "median_rental_price") |> #Pivot Longer
separate_wider_delim(Date, delim = ".",names = c("Year", "Month")) #Split Date
cleaned_rentaldata$Year <- gsub("X","",cleaned_rentaldata$Year) #Remove X value from Year
#Filter Data
cleaned_rentaldata <- cleaned_rentaldata |>
filter(Year %in%
c("2010", "2011","2012","2013","2014","2015","2016","2017",
"2018","2019","2020","2021","2022","2023"))
nrow(cleaned_rentaldata)
## [1] 33264
city_borough = cleaned_rentaldata %>%
filter(areaName %in%
c("NYC", "Manhattan", "Brooklyn", "Queens", "Bronx", "Staten Island"))
city_borough %>%
filter(is.na(median_rental_price))
## # A tibble: 29 × 6
## areaName Borough areaType Year Month median_rental_price
## <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 Staten Island Staten Island borough 2010 01 NA
## 2 Staten Island Staten Island borough 2010 02 NA
## 3 Staten Island Staten Island borough 2010 03 NA
## 4 Staten Island Staten Island borough 2010 04 NA
## 5 Staten Island Staten Island borough 2010 05 NA
## 6 Staten Island Staten Island borough 2010 06 NA
## 7 Staten Island Staten Island borough 2010 07 NA
## 8 Staten Island Staten Island borough 2010 08 NA
## 9 Staten Island Staten Island borough 2010 09 NA
## 10 Staten Island Staten Island borough 2010 10 NA
## # ℹ 19 more rows
city_borough <- city_borough %>%
mutate(Borough = case_when(
areaName == "NYC" ~ "NYC",
TRUE ~ Borough
))
Staten_rent = city_borough %>%
filter(areaName == "Staten Island")
Staten_rent$Year <- as.integer(Staten_rent$Year)
Staten_rent$Month <- as.integer(Staten_rent$Month)
skim(Staten_rent)
| Name | Staten_rent |
| Number of rows | 168 |
| Number of columns | 6 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| areaName | 0 | 1 | 13 | 13 | 0 | 1 | 0 |
| Borough | 0 | 1 | 13 | 13 | 0 | 1 | 0 |
| areaType | 0 | 1 | 7 | 7 | 0 | 1 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Year | 0 | 1.00 | 2016.50 | 4.04 | 2010 | 2013.00 | 2016.5 | 2020.00 | 2023 | ▇▇▅▇▇ |
| Month | 0 | 1.00 | 6.50 | 3.46 | 1 | 3.75 | 6.5 | 9.25 | 12 | ▇▅▅▅▇ |
| median_rental_price | 29 | 0.83 | 1879.73 | 232.96 | 1175 | 1800.00 | 1932.0 | 2000.00 | 2325 | ▁▁▃▇▂ |
library(mice)
library(randomForest)
library(ranger)
##
## Attaching package: 'ranger'
## The following object is masked from 'package:randomForest':
##
## importance
mice_staten_rent = mice::complete(mice(Staten_rent,method = "rf", seed = 617, rfPackage='randomForest'))
##
## iter imp variable
## 1 1 median_rental_price
## 1 2 median_rental_price
## 1 3 median_rental_price
## 1 4 median_rental_price
## 1 5 median_rental_price
## 2 1 median_rental_price
## 2 2 median_rental_price
## 2 3 median_rental_price
## 2 4 median_rental_price
## 2 5 median_rental_price
## 3 1 median_rental_price
## 3 2 median_rental_price
## 3 3 median_rental_price
## 3 4 median_rental_price
## 3 5 median_rental_price
## 4 1 median_rental_price
## 4 2 median_rental_price
## 4 3 median_rental_price
## 4 4 median_rental_price
## 4 5 median_rental_price
## 5 1 median_rental_price
## 5 2 median_rental_price
## 5 3 median_rental_price
## 5 4 median_rental_price
## 5 5 median_rental_price
## Warning: Number of logged events: 3
city_borough$Year <- as.integer(city_borough$Year)
city_borough$Month <- as.integer(city_borough$Month)
city_borough = city_borough %>%
filter(!(areaName == "Staten Island"))
city_borough <- rbind(city_borough, mice_staten_rent)
skim(city_borough)
| Name | city_borough |
| Number of rows | 1008 |
| Number of columns | 6 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| areaName | 0 | 1 | 3 | 13 | 0 | 6 | 0 |
| Borough | 0 | 1 | 3 | 13 | 0 | 6 | 0 |
| areaType | 0 | 1 | 4 | 7 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Year | 0 | 1 | 2016.50 | 4.03 | 2010 | 2013.00 | 2016.5 | 2020.00 | 2023 | ▇▇▅▇▇ |
| Month | 0 | 1 | 6.50 | 3.45 | 1 | 3.75 | 6.5 | 9.25 | 12 | ▇▅▅▅▇ |
| median_rental_price | 0 | 1 | 2449.88 | 640.01 | 1175 | 1966.00 | 2450.0 | 2900.00 | 4395 | ▅▇▇▃▁ |
csv for time analysis is now cleaned
write.csv(city_borough, 'city_borough_rent.csv',row.names = F)
How are rent prices expected to change in the next two years across New York City? To answer this, we will utilize a time-series analysis of the median rent prices for NYC from 2010 to 2023. This will help discern patterns and trends that may impact future prices and explore the variability of this across the city.
library(ggplot2);library(ggthemes);library(gridExtra) # For plots
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:randomForest':
##
## combine
## The following object is masked from 'package:dplyr':
##
## combine
library(quantmod);library(xts);library(zoo) # For using xts class objects
## Loading required package: xts
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## ######################### Warning from 'xts' package ##########################
## # #
## # The dplyr lag() function breaks how base R's lag() function is supposed to #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or #
## # source() into this session won't work correctly. #
## # #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop #
## # dplyr from breaking base R's lag() function. #
## # #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning. #
## # #
## ###############################################################################
##
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
##
## first, last
## Loading required package: TTR
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(forecast) # Set of forecasting functions
library(fpp); library(fpp2) # Datasets from Forecasting text by Rob Hyndman
## Loading required package: fma
## Loading required package: expsmooth
## Loading required package: lmtest
## Loading required package: tseries
##
## Attaching package: 'fpp2'
## The following objects are masked from 'package:fpp':
##
## ausair, ausbeer, austa, austourists, debitcards, departures,
## elecequip, euretail, guinearice, oil, sunspotarea, usmelec
library(tseries) # for a statistical test
library(dplyr) # Data wrangling
time_retal = city_borough
time_retal$Date = as.Date(paste(time_retal$Year, time_retal$Month, "01", sep="-"))
time_retal$`median_rental_price` <- as.numeric(time_retal$`median_rental_price`)
Plot for NYC
# NYC median rent over time
time_retal %>%
filter(areaType == "city")%>%
ggplot(aes(x=Date, y=median_rental_price)) +
geom_line() +
facet_grid(areaName~.) +
labs(title = "Median Rent Over Time NYC",
x = "Date",
y = "Price")
Plot for 5 boroughs
# 5 borogh median rent over time
time_retal %>%
filter(areaType == "borough")%>%
ggplot(aes(x=Date, y=median_rental_price)) +
geom_line() +
facet_grid(areaName~.) +
labs(title = "Median Rent Over Time by Borough",
x = "Date",
y = "Price")
Initial Observations: By taking a descriptive analysis on the median rental prices, we can observe a structural break in the data post-2020, which can be attributed to the macroeconomic impact of Covid-19, causing a drop in rental prices. However, we can observe that this was overset by an upward spike in the past 2 years. The seasonal plot shows how rental prices fluctuate in seasons, with minor peaks in the summer months and troughs in the winter months.
NYCtime_retal = city_borough %>%
filter(areaType == "city")
NYCtime_retal$Date <- as.Date(paste(NYCtime_retal$Year, NYCtime_retal$Month, "01", sep="-"))
# Convert the Median Rental Price to numeric
NYCtime_retal$`median_rental_price` <- as.numeric(NYCtime_retal$`median_rental_price`)
# Convert to a TS object
NYC_ts <- ts(NYCtime_retal$`median_rental_price`, start=c(2010, 1), frequency=12)
NYC_ts
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 2010 2650 2600 2600 2650 2695 2700 2750 2750 2800 2825 2800 2750
## 2011 2750 2725 2748 2795 2800 2895 2896 2950 3000 3000 3000 2995
## 2012 2950 2925 2950 2995 2995 2995 2995 3000 3050 3000 3000 3000
## 2013 2995 3000 3000 2995 2995 2995 2995 3000 3000 2995 2950 2930
## 2014 2900 2900 2900 2950 2950 2903 2887 2887 2850 2825 2800 2800
## 2015 2800 2800 2800 2850 2850 2899 2900 2950 2900 2895 2825 2850
## 2016 2860 2895 2900 2925 2950 2975 2953 2950 2900 2867 2800 2800
## 2017 2750 2750 2700 2750 2770 2795 2800 2856 2800 2795 2700 2700
## 2018 2695 2672 2699 2750 2795 2800 2825 2800 2784 2700 2658 2654
## 2019 2695 2700 2700 2800 2850 2900 2950 2949 2926 2900 2899 2895
## 2020 2900 2900 2925 2995 2950 2890 2800 2745 2650 2550 2500 2500
## 2021 2499 2500 2495 2500 2520 2600 2695 2700 2750 2700 2750 2800
## 2022 2900 2995 3000 3200 3350 3500 3550 3550 3500 3500 3400 3405
## 2023 3495 3500 3500 3675 3728 3750 3795 3750 3700 3600 3500 3500
Visualize NYC_ts
ggseasonplot(austourists) +
labs(title = "NYC Median Rent by year",
x = "Date",
y = "Price")
ggseasonplot(NYC_ts, polar=T) +
labs(title = "NYC Median Rent by year",
x = "Date",
y = "Price") # add marks hard to read ask gpt
NYC_ts%>%
stl(s.window = 'periodic')%>%
autoplot()
library(forecast)
pacf(x = NYC_ts)
Train and Test
train_NYC = window(NYC_ts,start=c(2010,01),end=c(2022,12)) #11 years on train
test_NYC = window(NYC_ts,start=c(2023,01),end=c(2023,12)) #1 year on test
seasonal_naive_model = snaive(train_NYC,h=12)
seasonal_naive_model$mean
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 2023 2900 2995 3000 3200 3350 3500 3550 3550 3500 3500 3400 3405
accuracy(seasonal_naive_model,x = NYC_ts)
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 50.55556 255.8956 168.0278 1.293095 5.700921 1.000000 0.9598879
## Test set 303.58333 348.7010 303.5833 8.417811 8.417811 1.806745 0.7521645
## Theil's U
## Training set NA
## Test set 4.295927
drift_model = rwf(train_NYC,h=12,drift = T)
drift_model$mean
## Jan Feb Mar Apr May Jun Jul Aug
## 2023 3409.871 3414.742 3419.613 3424.484 3429.355 3434.226 3439.097 3443.968
## Sep Oct Nov Dec
## 2023 3448.839 3453.710 3458.581 3463.452
accuracy(drift_model,x = NYC_ts)
## ME RMSE MAE MPE MAPE MASE
## Training set -1.195309e-13 45.99884 32.08075 -0.02137561 1.115442 0.1909253
## Test set 1.877554e+02 219.94235 187.75538 5.08626853 5.086269 1.1174068
## ACF1 Theil's U
## Training set 0.4292964 NA
## Test set 0.7012110 2.980163
ses_model = ses(train_NYC,h = 12)
ses_model$mean
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 2023 3405 3405 3405 3405 3405 3405 3405 3405 3405 3405 3405 3405
accuracy(ses_model,x = NYC_ts)
## ME RMSE MAE MPE MAPE MASE
## Training set 4.5182 46.28463 31.39466 0.1360588 1.091805 0.1868421
## Test set 219.4172 247.76192 219.41717 5.9591531 5.959153 1.3058386
## ACF1 Theil's U
## Training set 0.4346218 NA
## Test set 0.7082495 3.360923
holt_model = holt(train_NYC,h=12)
holt_model$mean
## Jan Feb Mar Apr May Jun Jul Aug
## 2023 3370.676 3342.576 3314.476 3286.376 3258.275 3230.175 3202.075 3173.975
## Sep Oct Nov Dec
## 2023 3145.874 3117.774 3089.674 3061.574
accuracy(holt_model,x=NYC_ts)
## ME RMSE MAE MPE MAPE MASE
## Training set 0.02514677 43.10313 32.73571 0.01975212 1.140001 0.1948232
## Test set 408.29168122 437.80641 408.29168 11.16654613 11.166546 2.4299059
## ACF1 Theil's U
## Training set -0.02300429 NA
## Test set 0.74804521 5.981476
holt_damped_model = holt(train_NYC, h=12, damped = T)
holt_damped_model$mean
## Jan Feb Mar Apr May Jun Jul Aug
## 2023 3377.408 3362.281 3350.160 3340.449 3332.668 3326.434 3321.439 3317.437
## Sep Oct Nov Dec
## 2023 3314.230 3311.661 3309.603 3307.954
accuracy(holt_damped_model,x=NYC_ts)
## ME RMSE MAE MPE MAPE MASE
## Training set 0.9865076 40.41997 30.97691 0.04057374 1.077790 0.1843559
## Test set 293.4397406 318.76429 293.43974 7.99663627 7.996636 1.7463764
## ACF1 Theil's U
## Training set -0.02522248 NA
## Test set 0.71559956 4.33274
hw_additive = hw(train_NYC,h=12,seasonal = 'additive', damped=T)
hw_additive$mean
## Jan Feb Mar Apr May Jun Jul Aug
## 2023 3382.099 3385.140 3391.893 3447.936 3464.794 3493.813 3490.153 3496.547
## Sep Oct Nov Dec
## 2023 3478.582 3446.330 3409.390 3412.947
accuracy(hw_additive,x = NYC_ts)
## ME RMSE MAE MPE MAPE MASE
## Training set 2.49225 38.5647 29.13235 0.08017633 1.013622 0.1733782
## Test set 182.78142 197.8411 182.78142 4.98266301 4.982663 1.0878048
## ACF1 Theil's U
## Training set 0.1784840 NA
## Test set 0.6785816 2.672182
hw_multiplicative = hw(train_NYC,h=12,seasonal = 'multiplicative', damped=T)
hw_multiplicative$mean
## Jan Feb Mar Apr May Jun Jul Aug
## 2023 3410.151 3415.702 3434.519 3526.436 3576.448 3634.081 3665.192 3684.871
## Sep Oct Nov Dec
## 2023 3672.138 3636.529 3582.870 3570.109
accuracy(hw_multiplicative,x=NYC_ts)
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 1.57594 37.21519 28.97611 0.0577228 1.008207 0.1724483 0.2358949
## Test set 56.99625 96.63299 88.58094 1.5359082 2.433496 0.5271803 0.7421122
## Theil's U
## Training set NA
## Test set 1.301538
ets_aaa = ets(train_NYC,model = 'AAA')
summary(ets_aaa)
## ETS(A,Ad,A)
##
## Call:
## ets(y = train_NYC, model = "AAA")
##
## Smoothing parameters:
## alpha = 0.9619
## beta = 0.1389
## gamma = 0.0279
## phi = 0.8238
##
## Initial states:
## l = 2739.3009
## b = 16.7025
## s = -28.6161 -33.2154 2.0309 34.9963 57.3081 42.5817
## 53.8002 22.091 6.3266 -44.6203 -48.5522 -64.1309
##
## sigma: 40.855
##
## AIC AICc BIC
## 1963.307 1968.300 2018.204
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 2.49225 38.5647 29.13235 0.08017633 1.013622 0.1733782 0.178484
checkresiduals(ets_aaa)
##
## Ljung-Box test
##
## data: Residuals from ETS(A,Ad,A)
## Q* = 59.061, df = 24, p-value = 8.657e-05
##
## Model df: 0. Total lags used: 24
The residuals from the ETS(A,Ad,A) model are not white noise since there is evidence of autocorrelation at lag 24. In this case, because the residuals show significant autocorrelation, it is advisable to revisit the model specification.
ets_auto = ets(train_NYC)
summary(ets_auto)
## ETS(M,Ad,N)
##
## Call:
## ets(y = train_NYC)
##
## Smoothing parameters:
## alpha = 0.8185
## beta = 0.6798
## phi = 0.8
##
## Initial states:
## l = 2581.9604
## b = 23.9438
##
## sigma: 0.0144
##
## AIC AICc BIC
## 1954.362 1954.926 1972.662
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 1.123409 40.80343 31.29631 0.04537647 1.089583 0.1862568
## ACF1
## Training set -0.0151643
ets_auto_forecast = forecast(ets_auto,h=12)
accuracy(ets_auto_forecast,x = NYC_ts)
## ME RMSE MAE MPE MAPE MASE
## Training set 1.123409 40.80343 31.29631 0.04537647 1.089583 0.1862568
## Test set 296.042525 321.34428 296.04253 8.06821713 8.068217 1.7618666
## ACF1 Theil's U
## Training set -0.0151643 NA
## Test set 0.7157938 4.368213
model_auto = auto.arima(y = train_NYC,d = 1,D = 1,stepwise = F,approximation = F)
model_auto
## Series: train_NYC
## ARIMA(2,1,0)(0,1,1)[12]
##
## Coefficients:
## ar1 ar2 sma1
## 0.2190 0.4040 -0.7446
## s.e. 0.0767 0.0779 0.0760
##
## sigma^2 = 1392: log likelihood = -723.99
## AIC=1455.97 AICc=1456.26 BIC=1467.82
checkresiduals(model_auto)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,1,0)(0,1,1)[12]
## Q* = 21.743, df = 21, p-value = 0.4145
##
## Model df: 3. Total lags used: 24
#Forecast
arima_auto_forecast <- forecast(model_auto, h=12)
accuracy(arima_auto_forecast, x = NYC_ts)
## ME RMSE MAE MPE MAPE MASE
## Training set 0.6321782 35.34529 26.67719 0.02739003 0.9221564 0.1587665
## Test set 128.9062272 144.36246 128.90623 3.51018986 3.5101899 0.7671721
## ACF1 Theil's U
## Training set 0.02671677 NA
## Test set 0.66884334 1.946462
accuracy_table = rbind(seasonal_naive_model = accuracy(f = seasonal_naive_model,x = NYC_ts)[2,],
drift_model = accuracy(f = drift_model,x = NYC_ts)[2,],
ses_model = accuracy(f = ses_model,x = NYC_ts)[2,],
holt_model = accuracy(f = holt_model,x = NYC_ts)[2,],
holt_damped_model = accuracy(f = holt_damped_model,x = NYC_ts)[2,],
hw_additive_model = accuracy(f = hw_additive,x = NYC_ts)[2,],
hw_multiplicative = accuracy(f = hw_multiplicative,x = NYC_ts)[2,],
ets_auto = accuracy(ets_auto_forecast,x = NYC_ts)[2,],
arima = accuracy(arima_auto_forecast,x=NYC_ts)[2,]
)
## Warning in accuracy.default(f = seasonal_naive_model, x = NYC_ts): Using `f` as
## the argument for `accuracy()` is deprecated. Please use `object` instead.
## Warning in accuracy.default(f = drift_model, x = NYC_ts): Using `f` as the
## argument for `accuracy()` is deprecated. Please use `object` instead.
## Warning in accuracy.default(f = ses_model, x = NYC_ts): Using `f` as the
## argument for `accuracy()` is deprecated. Please use `object` instead.
## Warning in accuracy.default(f = holt_model, x = NYC_ts): Using `f` as the
## argument for `accuracy()` is deprecated. Please use `object` instead.
## Warning in accuracy.default(f = holt_damped_model, x = NYC_ts): Using `f` as
## the argument for `accuracy()` is deprecated. Please use `object` instead.
## Warning in accuracy.default(f = hw_additive, x = NYC_ts): Using `f` as the
## argument for `accuracy()` is deprecated. Please use `object` instead.
## Warning in accuracy.default(f = hw_multiplicative, x = NYC_ts): Using `f` as
## the argument for `accuracy()` is deprecated. Please use `object` instead.
accuracy_table
## ME RMSE MAE MPE MAPE
## seasonal_naive_model 303.58333 348.70104 303.58333 8.417811 8.417811
## drift_model 187.75538 219.94235 187.75538 5.086269 5.086269
## ses_model 219.41717 247.76192 219.41717 5.959153 5.959153
## holt_model 408.29168 437.80641 408.29168 11.166546 11.166546
## holt_damped_model 293.43974 318.76429 293.43974 7.996636 7.996636
## hw_additive_model 182.78142 197.84110 182.78142 4.982663 4.982663
## hw_multiplicative 56.99625 96.63299 88.58094 1.535908 2.433496
## ets_auto 296.04253 321.34428 296.04253 8.068217 8.068217
## arima 128.90623 144.36246 128.90623 3.510190 3.510190
## MASE ACF1 Theil's U
## seasonal_naive_model 1.8067449 0.7521645 4.295927
## drift_model 1.1174068 0.7012110 2.980163
## ses_model 1.3058386 0.7082495 3.360923
## holt_model 2.4299059 0.7480452 5.981476
## holt_damped_model 1.7463764 0.7155996 4.332740
## hw_additive_model 1.0878048 0.6785816 2.672182
## hw_multiplicative 0.5271803 0.7421122 1.301538
## ets_auto 1.7618666 0.7157938 4.368213
## arima 0.7671721 0.6688433 1.946462
autoplot(train_NYC, color='black')+
autolayer(test_NYC,size=1.2,color='black')+
autolayer(seasonal_naive_model,series = 'Seasonal Naive Model',PI=F)+
autolayer(drift_model,series = 'Drift Model',PI=F)+
autolayer(ses_model,series = 'SES Model',PI=F)+
autolayer(holt_model,series = 'Holt',PI=F)+
autolayer(holt_damped_model,series = 'Holt Damped',PI=F)+
autolayer(hw_additive,series = 'Holt Winter Additive',PI=F)+
autolayer(hw_multiplicative,series = 'Holt Winter Multiplicative',PI=F)+
autolayer(ets_auto_forecast,series = 'ETS Auto',PI=F)+
autolayer(arima_auto_forecast,series = 'Arima Auto',PI=F)+
labs(title = "Prediction Performance: Median Rent Over Time NYC",
x = "Date",
y = "Price")
Model Selection: To obtain predictions with the highest accuracy, we compare the performance of simple forecasting, exponential smoothing, ETS and ARIMA models on the testing data. As observed in Figure 4, the Holt Winter multiplicative method closely mimics the testing data. This model is effective as the data exhibits both seasonal and trend variations. It provides us with the lowest RMSE score at 96.63. Additionally, the diagnostic checks show the residuals behaving as white noise, with no apparent autocorrelations, affirming the model’s fitness. The model’s forecast aligns well with the actual data when visualized, offering credibility for future rental price predictions.
NYC_forecasted_2425 <- hw(train_NYC,h=36,seasonal = 'multiplicative', damped=T)
NYC_forecasted_2425$mean
## Jan Feb Mar Apr May Jun Jul Aug
## 2023 3410.151 3415.702 3434.519 3526.436 3576.448 3634.081 3665.192 3684.871
## 2024 3564.451 3559.971 3569.977 3656.353 3699.560 3751.000 3775.432 3788.509
## 2025 3636.349 3627.194 3633.096 3716.889 3756.926 3805.479 3826.799 3836.800
## Sep Oct Nov Dec
## 2023 3672.138 3636.529 3582.870 3570.109
## 2024 3768.737 3726.022 3665.371 3647.043
## 2025 3813.748 3767.721 3703.813 3682.891
autoplot(NYC_forecasted_2425)
Forecasting: By observing Figure 4, We can observe that there is an upward trend in the median rent prices. The forecast for January 2023 starts at USD3,410, and by December 2025, it rises to USD3,682. The method incorporates seasonality and a damping factor that gradually reduces the trend component over time, providing a point forecast with a confidence interval, as indicated by the shaded region. The multiplicative nature of the method suggests that seasonal changes are proportionally related to the level of the time series. This steady increase could be attributed to growth in the rental market due to inflation, housing demand and other macroeconomic factors.
Noise Complaint Analysis: How do noise complaints vary across various neighborhoods in the city? To analyze this, we can implement cluster analysis to group neighborhoods on the basis of the various types of noise complaints. This will help identify the characteristics of a given neighborhood and determine whether it is suitable for a residents requirements.
cleaned_noise_final =read.csv("/Users/tianlanmac16/Desktop/Columbia AA/5205-R/StatSquad/NOISE/cleaned_noise_final.csv")
library(dplyr)
library(skimr)
library(tidyr)
library(lubridate)
H2_data<- cleaned_noise_final%>%
mutate(year = year(Created.Date))
H2_data = H2_data %>%
filter(year %in% c(2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022))
H2_sum = H2_data %>%
group_by(Community.Board, Borough) %>%
summarise(complaints = n()) %>%
arrange(desc(complaints))
## `summarise()` has grouped output by 'Community.Board'. You can override using
## the `.groups` argument.
sum(H2_sum$complaints)
## [1] 5279309
In 2012-2022, top5 community board that received most complaints
H2_sum$Community.Board[1:5]
## [1] "12 MANHATTAN" "12 BRONX" "01 BROOKLYN" "03 MANHATTAN" "10 MANHATTAN"
In 2012-2022, top5 community board that received least complaints
H2_sum$Community.Board[67:71]
## [1] "28 BRONX" "56 BROOKLYN" "84 QUEENS" "80 QUEENS" "83 QUEENS"
each borough, by community board, display total complaints using facet.grid
library(ggplot2)
library(dplyr)
# Assuming H2_sum is already loaded
# Identify the highest and lowest points in each Borough
aaa <- H2_sum %>%
group_by(Borough) %>%
mutate(
PointType = case_when(
complaints == max(complaints) ~ "Highest",
complaints == min(complaints) ~ "Lowest",
TRUE ~ "Other"
)
) %>%
ungroup()
# Plot with points highlighted
ggplot(aaa, aes(x=Community.Board, y=complaints)) +
geom_point(aes(color = PointType, shape = PointType)) + # Color and shape based on the PointType
scale_shape_manual(values=c(20,5,20)) + # Assigning specific shapes to points
scale_color_manual(values=c("red", "blue", "black")) + # Assigning colors to the highest and lowest points
scale_y_continuous(labels = scales::comma) + # To change y-axis labels to plain numbers
facet_grid(.~Borough) +
labs(x = 'Community Board', y = 'Total Complaints', title = 'Total Complaints by Community Board and Borough') +
geom_text(aes(label = Community.Board), position = position_dodge(width = 0.9), hjust = -0.1, size = 1.5) +
theme(legend.position = "none") # Hide the legend
ggplot(aaa, aes(x=Community.Board, y=complaints)) +
geom_point(aes(color = PointType, shape = PointType), size = 3) + # Color and shape based on the PointType
scale_shape_manual(values=c(20,5,20)) + # Assigning specific shapes to points
scale_color_manual(values=c("red", "blue", "black")) + # Assigning colors to the highest and lowest points
scale_y_continuous(labels = scales::comma) + # To change y-axis labels to plain numbers
facet_grid(.~Borough, scales = "free_x") + # Use free scales for the x-axis
labs(x = 'Community Board', y = 'Total Complaints', title = 'Total Complaints by Community Board and Borough') +
geom_text(data = subset(aaa, PointType != "Other"), aes(label = Community.Board), hjust = 1.1, vjust = 0, size = 1) +
theme(
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
legend.position = "none",
strip.text.x = element_text(size = 3) # Ensure facet labels are small enough to fit
) # Hide the legend and x-axis text/ticks
Initial observations: Upon visualizing the data, the number of complaints from each Community Board can be observed. 12 MANHATTAN (including the neighborhoods of Inwood and Washington Heights) and 12 BRONX (including neighborhoods of Edenwald, Wakefield, Williamsbridge, Woodlawn, Fish Bay, Eastchester, Olinville, and Baychester) appear to have a significantly high number of complaints while Staten Island has the least number of complaints.
H2_c = H2_data %>%
group_by(Complaint.Type, Community.Board, Borough) %>%
summarise(complaints = n())
## `summarise()` has grouped output by 'Complaint.Type', 'Community.Board'. You
## can override using the `.groups` argument.
sum(H2_c$complaints)
## [1] 5279309
Pivot Wider for future clustering
H2_c <- H2_c %>%
spread(key = Complaint.Type, value = complaints, fill = 0)
write.csv(H2_c, 'H2_datac.csv',row.names = F)
write.csv(H2_sum, 'H2_sum.csv',row.names = F)
data = read.csv(file = 'H2_datac.csv',stringsAsFactors = F)
data_cluster = data[,3:12]
data_cluster = scale(data_cluster)
head(data_cluster[,1:10])
## Collection.Truck.Noise Noise Noise...Commercial Noise...Helicopter
## [1,] -0.3792224 -0.63139849 -0.47927000 -0.30990948
## [2,] 0.5314244 1.81479270 3.79085294 -0.22562332
## [3,] 0.1671657 1.79003636 0.03230396 -0.01826716
## [4,] 1.6242004 0.77598766 2.70223534 -0.20943794
## [5,] 0.3796499 0.05348697 -0.27000800 -0.24364101
## [6,] -0.6827713 -0.80757713 -0.64086899 -0.31235255
## Noise...House.of.Worship Noise...Park Noise...Residential
## [1,] -0.26144715 -0.3905683 0.1459226
## [2,] 0.65630352 0.9670819 0.7177436
## [3,] -0.46608074 -0.3603983 -0.7857963
## [4,] -0.03510999 0.4775050 0.3491431
## [5,] -0.28935173 -0.5496465 0.2587831
## [6,] -0.27074868 -0.7402661 -0.4653364
## Noise...Street.Sidewalk Noise...Vehicle Noise.Survey
## [1,] 0.01921795 -0.1113019 -0.3979640
## [2,] 1.06722254 1.3230505 1.5244315
## [3,] -0.12698780 -0.1928585 -0.2848819
## [4,] 0.09173061 0.2757294 0.9665599
## [5,] -0.37055162 -0.1432280 0.3483778
## [6,] -0.25616150 -0.3613413 -0.4055028
d = dist(x = data_cluster,method = 'euclidean')
clusters = hclust(d = d,method='ward.D2')
plot(clusters)
cor(cophenetic(clusters),d)
## [1] 0.593858
CPC> 0.7 indicates relatively strong fit, 0.3<CPC<0.7 indicates moderate fit.
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(gridExtra)
grid.arrange(fviz_dend(x = clusters,k=2),
fviz_dend(x = clusters,k=3),
fviz_dend(x = clusters,k=4)
)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
h_segments3 = cutree(tree = clusters,k=3)
table(h_segments3)
## h_segments3
## 1 2 3
## 46 13 12
library(cluster)
clusplot(data_cluster,
h_segments3,
color=T,shade=T,labels=4,lines=0,main='Hierarchical Cluster Plot')
within_ss = sapply(1:10,FUN = function(x){
set.seed(617)
kmeans(x = data_cluster,centers = x,iter.max = 1000,nstart = 25)$tot.withinss})
ggplot(data=data.frame(cluster = 1:10,within_ss),aes(x=cluster,y=within_ss))+
geom_line(col='steelblue',size=1.2)+
geom_point()+
scale_x_continuous(breaks=seq(1,10,1))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
ratio_ss = sapply(1:10,FUN = function(x) {
set.seed(617)
km = kmeans(x = data_cluster,centers = x,iter.max = 1000,nstart = 25)
km$betweenss/km$totss} )
ggplot(data=data.frame(cluster = 1:10,ratio_ss),aes(x=cluster,y=ratio_ss))+
geom_line(col='steelblue',size=1.2)+
geom_point()+
scale_x_continuous(breaks=seq(1,10,1))
library(cluster)
silhoette_width = sapply(2:10,
FUN = function(x) pam(x = data_cluster,k = x)$silinfo$avg.width)
ggplot(data=data.frame(cluster = 2:10,silhoette_width),aes(x=cluster,y=silhoette_width))+
geom_line(col='steelblue',size=1.2)+
geom_point()+
scale_x_continuous(breaks=seq(2,10,1))
set.seed(617)
km3 = kmeans(x = data_cluster,centers = 3,iter.max=10000,nstart=25)
k_segments3 = km3$cluster
table(k_segments3)
## k_segments3
## 1 2 3
## 18 49 4
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
## The following object is masked from 'package:randomForest':
##
## outlier
temp3 = data.frame(cluster = factor(k_segments3),
factor1 = fa(data_cluster,nfactors = 2,rotate = 'varimax')$scores[,1],
factor2 = fa(data_cluster,nfactors = 2,rotate = 'varimax')$scores[,2])
ggplot(temp3,aes(x=factor1,y=factor2,col=cluster))+
geom_point()
library(cluster)
clusplot(data_cluster,
k_segments3,
color=T,shade=T,labels=4,lines=0,main='k-means Cluster Plot')
library(mclust)
## Package 'mclust' version 6.0.1
## Type 'citation("mclust")' for citing this R package in publications.
##
## Attaching package: 'mclust'
## The following object is masked from 'package:psych':
##
## sim
mclust_bic = sapply(1:10,FUN = function(x) -Mclust(data_cluster,G=x)$bic)
mclust_bic
## EEE,1
## 1746.8072 1098.7971 910.0312 866.0948 1537.5045 1672.9681 1303.1518 1337.1515
##
## 1464.8442 1408.5478
ggplot(data=data.frame(cluster = 1:10,bic = mclust_bic),aes(x=cluster,y=bic))+
geom_line(col='steelblue',size=1.2)+
geom_point()+
scale_x_continuous(breaks=seq(1,10,1))
library(mclust)
m_clusters3 = Mclust(data_cluster, G=3)
summary(m_clusters3)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust VEV (ellipsoidal, equal shape) model with 3 components:
##
## log-likelihood n df BIC ICL
## -73.50577 71 179 -910.0312 -910.0314
##
## Clustering table:
## 1 2 3
## 34 26 11
m_segments3 = m_clusters3$classification
table(m_segments3)
## m_segments3
## 1 2 3
## 34 26 11
library(cluster)
clusplot(data_cluster,
m_segments3,
color=T,shade=T,labels=4,lines=0,main='mclust Cluster Plot')
table(h_segments3)
## h_segments3
## 1 2 3
## 46 13 12
table(k_segments3)
## k_segments3
## 1 2 3
## 18 49 4
table(m_segments3)
## m_segments3
## 1 2 3
## 34 26 11
Model Selection: A K-means model is selected for this analysis, as it is scalable and easy to implement. A model with 3 clusters is selected as it has the lowest Silhouette Width as viewed in Figure 7. Hierarchical clustering and model-based clustering are also tested out in both 3 and 4 clusters, however, the results from such are less easily interpretable when looking at the clustering profile. One noticeable community board stood out when setting the clusters between 3 and 4 in both hierarchical or K-means clustering, which is 08 Queens (including the neighborhoods of Jamaica Estates, Holliswood, and Flushing South). This areas shows high complaints from residential, vehicle and truck noise while holding the least complaints from the commercial category, indicating a noisy area with a less vibrant lifestyle near the JFK airport.
data2 = cbind(data, k_segments3)
library(dplyr)
data2 %>%
select(Collection.Truck.Noise:Noise.Survey,k_segments3)%>%
group_by(k_segments3)%>%
summarize_all(function(x) round(mean(x,na.rm=T),0))%>%
data.frame()
## k_segments3 Collection.Truck.Noise Noise Noise...Commercial
## 1 1 26 10829 13797
## 2 2 16 4656 3534
## 3 3 120 29820 11606
## Noise...Helicopter Noise...House.of.Worship Noise...Park Noise...Residential
## 1 461 327 1503 73297
## 2 431 109 531 26096
## 3 10925 131 556 31257
## Noise...Street.Sidewalk Noise...Vehicle Noise.Survey
## 1 32404 13256 267
## 2 7001 3188 68
## 3 12380 6142 168
data2 %>%
select(Collection.Truck.Noise:Noise.Survey, k_segments3) %>%
group_by(k_segments3) %>%
summarize_all(function(x) round(mean(x, na.rm = T), 0)) %>%
gather(key = var, value = value, Collection.Truck.Noise:Noise.Survey) %>%
mutate(var = ifelse(var == "Noise", "Noise.Random", var)) %>%
ggplot(aes(x = var, y = value, fill = factor(k_segments3))) +
geom_col(position = 'dodge') +
coord_flip() +
geom_text(aes(label = value), position = position_dodge(width = 0.9), hjust = -0.1, size = 1.5) +
theme(legend.position = "bottom",
legend.box = "horizontal",
legend.key.size = unit(0.5, "cm"),
legend.text = element_text(size = 6),
legend.title = element_text(size = 7))
Community.Board Count k_segments4 1 2 3 4 17 49 4 1 08 Queens stood out
write.csv(data2, 'data2.csv',row.names = F)
# Add a new column with the sum of columns 3 through 12
cluster_profile <- mutate(data2, total_complaints = rowSums(data2[, 3:12], na.rm = TRUE))
cluster_profile <- cluster_profile %>%
select(1:2, total_complaints, everything())
table(cluster_profile$k_segments3,data2[,2])
##
## BRONX BROOKLYN MANHATTAN QUEENS STATEN ISLAND
## 1 5 5 6 2 0
## 2 10 15 3 17 4
## 3 0 0 4 0 0
Segment1
cluster_profile %>%
filter(k_segments3 == 1) %>%
group_by(Community.Board, Borough) %>%
select(1:13) %>%
arrange(total_complaints)
## # A tibble: 18 × 13
## # Groups: Community.Board, Borough [18]
## Community.Board Borough total_complaints Collection.Truck.Noise Noise
## <chr> <chr> <dbl> <int> <int>
## 1 07 BROOKLYN BROOKLYN 52834 22 7093
## 2 02 MANHATTAN MANHATTAN 96231 52 22785
## 3 11 MANHATTAN MANHATTAN 100502 16 6915
## 4 02 BROOKLYN BROOKLYN 100663 31 21005
## 5 08 QUEENS QUEENS 101814 38 17885
## 6 09 BRONX BRONX 111318 11 3342
## 7 01 QUEENS QUEENS 117039 78 14096
## 8 04 BROOKLYN BROOKLYN 122780 5 7271
## 9 09 MANHATTAN MANHATTAN 132066 31 7756
## 10 05 BRONX BRONX 136208 3 1762
## 11 03 BROOKLYN BROOKLYN 136652 11 10529
## 12 07 BRONX BRONX 152444 11 4216
## 13 04 BRONX BRONX 163385 3 2464
## 14 10 MANHATTAN MANHATTAN 167450 24 9224
## 15 03 MANHATTAN MANHATTAN 171221 40 21412
## 16 01 BROOKLYN BROOKLYN 172334 42 22740
## 17 12 BRONX BRONX 279758 6 2643
## 18 12 MANHATTAN MANHATTAN 316333 45 11788
## # ℹ 8 more variables: Noise...Commercial <int>, Noise...Helicopter <int>,
## # Noise...House.of.Worship <int>, Noise...Park <int>,
## # Noise...Residential <int>, Noise...Street.Sidewalk <int>,
## # Noise...Vehicle <int>, Noise.Survey <int>
Segment2
cluster_profile %>%
filter(k_segments3 == 2) %>%
group_by(Community.Board, Borough) %>%
select(1:13)
## # A tibble: 49 × 13
## # Groups: Community.Board, Borough [49]
## Community.Board Borough total_complaints Collection.Truck.Noise Noise
## <chr> <chr> <dbl> <int> <int>
## 1 01 BRONX BRONX 68947 12 2385
## 2 01 MANHATTAN MANHATTAN 56867 30 22534
## 3 01 STATEN ISLAND STATEN ISLAND 73369 37 8084
## 4 02 BRONX BRONX 37180 2 919
## 5 02 QUEENS QUEENS 59091 30 10010
## 6 02 STATEN ISLAND STATEN ISLAND 27109 21 4981
## 7 03 BRONX BRONX 75296 5 1581
## 8 03 QUEENS QUEENS 77251 12 4751
## 9 03 STATEN ISLAND STATEN ISLAND 26025 32 5631
## 10 04 QUEENS QUEENS 57441 4 3552
## # ℹ 39 more rows
## # ℹ 8 more variables: Noise...Commercial <int>, Noise...Helicopter <int>,
## # Noise...House.of.Worship <int>, Noise...Park <int>,
## # Noise...Residential <int>, Noise...Street.Sidewalk <int>,
## # Noise...Vehicle <int>, Noise.Survey <int>
Segment3
cluster_profile %>%
filter(k_segments3 == 3) %>%
group_by(Community.Board, Borough) %>%
select(1:13)
## # A tibble: 4 × 13
## # Groups: Community.Board, Borough [4]
## Community.Board Borough total_complaints Collection.Truck.Noise Noise
## <chr> <chr> <dbl> <int> <int>
## 1 04 MANHATTAN MANHATTAN 109874 37 30683
## 2 06 MANHATTAN MANHATTAN 79380 66 26555
## 3 07 MANHATTAN MANHATTAN 136185 185 29754
## 4 08 MANHATTAN MANHATTAN 86978 190 32288
## # ℹ 8 more variables: Noise...Commercial <int>, Noise...Helicopter <int>,
## # Noise...House.of.Worship <int>, Noise...Park <int>,
## # Noise...Residential <int>, Noise...Street.Sidewalk <int>,
## # Noise...Vehicle <int>, Noise.Survey <int>
prop.table(table(cluster_profile$k_segments,data2[,2]),1)
##
## BRONX BROOKLYN MANHATTAN QUEENS STATEN ISLAND
## 1 0.27777778 0.27777778 0.33333333 0.11111111 0.00000000
## 2 0.20408163 0.30612245 0.06122449 0.34693878 0.08163265
## 3 0.00000000 0.00000000 1.00000000 0.00000000 0.00000000
library(ggplot2)
tab = prop.table(table(cluster_profile$k_segments,data2[,2]),1)
tab2 = data.frame(round(tab,2))
library(RColorBrewer)
ggplot(data=tab2,aes(x=Var2,y=Var1,fill=Freq))+
geom_tile()+
geom_text(aes(label=Freq),size=6)+
xlab(label = '')+
ylab(label = '')+
scale_fill_gradientn(colors=brewer.pal(n=9,name = 'Reds'))
Cluster 1: High Levels of Noise Complaints. These neighborhoods have the highest levels of noise complaints, particularly for residential, street/sidewalk, and vehicle noises. This indicates high population density and commercial zones with significant street activity. 18 areas fall under this cluster, including the East and West Village, Park Slope, and the center of Flushing. Individuals who enjoy vibrant, lively environments and prioritize convenience over noise could choose to live here.
Cluster 2: Moderate to Low levels of noise complaints. These neighborhoods may have a small mix of residential and commercial areas. All areas within Staten Island fall under this category, indicating that it is the quietest borough overall. The majority of the Bronx, the Financial District, and Jackson Heights also fall under this category. These neighborhoods are ideal for families and individuals who prioritize peace and tranquility and live in a commutable distance from urban conveniences.
Cluster 3: High to Moderate Levels of Noise Complaints. These neighborhoods have relatively moderate levels from streets and sidewalks indicating that they are not very densely populated. However commercial, helicopter and random noises are significantly higher here than in the other two clusters. This could be due to proximity to helipads, offices, etc. As observed in Figure 9, only Manhattan neighborhoods lie in this cluster, including Chelsea, the Upper West Side and the Upper East Side. These areas can be suitable for residents and families that seek to be within short commutable distance from restaurants, commercial spaces, and attractions.
Crime Level Analysis: How does crime quantity & type differ between neighborhoods? To understand how crime quantity differs between neighborhoods we will use spatial analysis to understand the distribution of crime across New York City. Secondly, we will also use spatial analysis to understand the most complaints per precinct.
library(ggmap)
## ℹ Google's Terms of Service: <https://mapsplatform.google.com>
## Stadia Maps' Terms of Service: <https://stadiamaps.com/terms-of-service/>
## OpenStreetMap's Tile Usage Policy: <https://operations.osmfoundation.org/policies/tiles/>
## ℹ Please cite ggmap if you use it! Use `citation("ggmap")` for details.
library(ggplot2)
library(RColorBrewer)
crimedata = read.csv("crime_cleaned.csv")
register_google(key = "AIzaSyAnZcVKsOpJiQf8CDn-I_GEtXSjUmNNNHQ")
newyork_map = get_map(location = c(-73.93835,40.66392),zoom = 10,scale = 4)
## ℹ <https://maps.googleapis.com/maps/api/staticmap?center=40.66392,-73.93835&zoom=10&size=640x640&scale=4&maptype=terrain&language=en-EN&key=xxx-I_GEtXSjUmNNNHQ>
ggmap(newyork_map) +
geom_point(data = crimedata, aes(x = Longitude, y = Latitude, color = LAW_CAT_CD), size = 0.5) +
scale_color_manual(name = "Crime Category", values = c("FELONY" = "red", "MISDEMEANOR" = "orange", "VIOLATION" = "yellow"), labels = c("Felony", "Misdemeanor", "Violation")) +
theme(legend.text = element_text(size = 12), # Adjust the size of legend text
legend.title = element_text(size = 14), # Adjust the size of legend title
legend.key.size = unit(2, "lines"), # Adjust the size of legend symbols
legend.key.height = unit(2, "lines"), # Increase the height of legend symbols
legend.key.width = unit(2, "lines")) + # Increase the width of legend symbols
guides(color = guide_legend(override.aes = list(size = 5))) + # Increase the size of color points in the legend
labs(title = "Crime Complaints Distribution in New York City 2012 - 2022", hjust = 0.5, size = 20) # Add a centered and larger title to the graph
This map shows the distribution of crime by type (felony, misdemeanor, violation) across the city’s geography. Crime is more concentrated in Manhattan, Southern Bronx & Central Brooklyn. Felonies were the most common category of crime Displaying the frequency of crimes as color-coded points allows for immediate visual identification of areas with higher crime densities. For spatial analysis, this map can help identify hotspots for specific types of crime and observe any geographic patterns or clusters. For example, a high concentration of red points in a particular area would indicate a region with a high incidence of felonies.
library(ggmap)
library(ggplot2)
library(dplyr)
data = read.csv("crime_summary.csv")
register_google(key = "AIzaSyAnZcVKsOpJiQf8CDn-I_GEtXSjUmNNNHQ")
newyork_map = get_map(location = c(-73.93835,40.66392),zoom = 10,scale = 4)
## ℹ <https://maps.googleapis.com/maps/api/staticmap?center=40.66392,-73.93835&zoom=10&size=640x640&scale=4&maptype=terrain&language=en-EN&key=xxx-I_GEtXSjUmNNNHQ>
classify_impact <- function(ofns_desc, law_cat_cd) {
if (law_cat_cd == "FELONY") {
high_impact <- c("HOMICIDE-NEGLIGENT,UNCLASIFIE", "MURDER & NON-NEGL. MANSLAUGHTER",
"HOMICIDE-NEGLIGENT,UNCLASSIFIE", "RAPE", "ROBBERY", "ARSON", "KIDNAPPING & RELATED OFFENSES", "DANGEROUS WEAPONS",
"SEX CRIMES", "THEFT-FRAUD")
medium_impact <- c("FELONY ASSAULT", "GRAND LARCENY", "GRAND LARCENY OF MOTOR VEHICLE",
"FORGERY", "POSSESSION OF STOLEN PROPERTY", "DANGEROUS DRUGS",
"MISCELLANEOUS PENAL LAW")
low_impact <- c("BURGLARY", "CRIMINAL MISCHIEF & RELATED OF", "NYS LAWS-UNCLASSIFIED FELONY",
"OBSCENIT", "PROSTITUTION & RELATED OFFENSES", "OTHER STATE LAWS (NON PENAL LA")
} else if (law_cat_cd == "MISDEMEANOR") {
high_impact <- c("ASSAULT 3 & RELATED OFFENSES", "CRIMINAL MISCHIEF & RELATED OF",
"DANGEROUS WEAPONS", "SEX CRIMES", "OFFENSES RELATED TO CHILDREN")
medium_impact <- c("OFFENSES AGAINST PUBLIC ADMINI", "OFFENSES AGAINST PUBLIC SAFETY",
"OFFENSES AGAINST THE PERSON", "OFFENSES INVOLVING FRAUD",
"OTHER OFFENSES RELATED TO THEF", "OFF. AGNST PUB ORD SENSBLTY &", "DANGEROUS DRUGS")
low_impact <- c("ADMINISTRATIVE CODE", "AGRICULTURE & MRKTS LAW-UNCLASSIFIED",
"ANTICIPATORY OFFENSES", "CRIMINAL TRESPASS", "FRAUDS", "FRAUDULENT ACCOSTING",
"INTOXICATED & IMPAIRED DRIVING", "JOSTLING", "OTHER STATE LAWS (NON PENAL LA",
"PETIT LARCENY", "PETIT LARCENY OF MOTOR VEHICLE", "POSSESSION OF STOLEN PROPERTY",
"THEFT OF SERVICES", "UNAUTHORIZED USE OF A VEHICLE", "VEHICLE AND TRAFFIC LAWS")
} else if (law_cat_cd == "VIOLATION") {
high_impact <- c("HARASSMENT 2", "HARRASSMENT 2")
medium_impact <- c("ADMINISTRATIVE CODE", "OTHER STATE LAWS")
low_impact <- c("MISCELLANEOUS PENAL LAW")
} else {
return(NA) # Return NA for cases with unknown LAW_CAT_CD
}
if (ofns_desc %in% high_impact) {
return("High Impact")
} else if (ofns_desc %in% medium_impact) {
return("Medium Impact")
} else if (ofns_desc %in% low_impact) {
return("Low Impact")
} else {
return(NA) # Return NA for cases not classified
}
}
data$impact <- mapply(classify_impact, data$OFNS_DESC, data$LAW_CAT_CD)
top_crime_precinct <- data %>%
group_by(ADDR_PCT_CD, LAW_CAT_CD, impact) %>%
summarize(total_complaints = sum(complaints)) %>%
group_by(ADDR_PCT_CD) %>%
top_n(1, total_complaints) %>%
ungroup()
## `summarise()` has grouped output by 'ADDR_PCT_CD', 'LAW_CAT_CD'. You can
## override using the `.groups` argument.
top_crime_precinct <- top_crime_precinct %>%
left_join(data %>% distinct(ADDR_PCT_CD, longitude, latitude, BORO_NM), by = "ADDR_PCT_CD")
ggmap(newyork_map) +
geom_point(data = top_crime_precinct, aes(x = longitude, y = latitude, color = LAW_CAT_CD, shape = impact, size = (total_complaints/10))) +
scale_color_manual(name = "Crime Category", values = c("FELONY" = "red", "MISDEMEANOR" = "orange", "VIOLATION" = "yellow"), labels = c("Felony", "Misdemeanor", "Violation")) +
scale_shape_manual(name = "Impact Level", values = c("High Impact" = 15, "Medium Impact" = 17, "Low Impact" = 19)) +
scale_size_continuous(name = "Total Complaints", guide = "none") +
theme(legend.text = element_text(size = 12),
legend.title = element_text(size = 14),
legend.key.size = unit(2, "lines")) +
guides(color = guide_legend(override.aes = list(size = 5)),
shape = guide_legend(override.aes = list(size = 5))) +
labs(title = "Most Complaints in each Precint by type of Crime and Severity in New York City 2012 - 2022", hjust = 0.5, size = 20)
This map takes a different approach by focusing on the most common complaints per precinct, categorized by the severity of the crime. Distribution of the most complaints by crime category across NYC precincts Most complaints in Midtown & Lower Manhattan were medium-impact felonies. Low-impact misdemeanors were the category with the most complaints in Staten Island It provides a summarized view where each precinct is represented by a single symbol that characterizes the most frequent type of complaint. This helps in quickly identifying which precincts have higher levels of certain types of crimes and how the impact level of crimes varies across precincts.
top_five_crime_impact <- top_crime_precinct %>%
arrange(desc(total_complaints)) %>%
slice_head(n = 5) %>%
select(BORO_NM, ADDR_PCT_CD, LAW_CAT_CD, impact, total_complaints) %>%
rename(
Borough = BORO_NM,
Precinct = ADDR_PCT_CD,
'Category of Crime' = LAW_CAT_CD,
Impact = impact,
'Number of Complaints' = total_complaints
)
top_five_crime_impact
## # A tibble: 5 × 5
## Borough Precinct `Category of Crime` Impact `Number of Complaints`
## <chr> <int> <chr> <chr> <int>
## 1 BRONX 52 MISDEMEANOR Low Impact 143
## 2 QUEENS 114 FELONY Medium Impact 102
## 3 MANHATTAN 19 FELONY Medium Impact 95
## 4 BROOKLYN 67 MISDEMEANOR Low Impact 90
## 5 BROOKLYN 75 MISDEMEANOR Low Impact 89
While this study is comprehensive, we encountered several limitations that warrant mention. We found that connecting our datasets together was quite a challenge due to different neighborhood identifiers such as neighborhood name, zip code, precinct and community board. Additionally, the lack of publicly available data on rental amenities limited the analysis we were able to conduct on rental prices in the city. Further research would include the formation of a directory that would allow us to connect neighborhoods with greater ease and the gathering of unit specific amenities. Additionally, now that we have the data to determine what needs to be improved for the various neighborhoods, we would want to conduct further research into what exact actions can be taken to improve the situations these communities face related to crime & noise.
Written by Group 6 (Stat Squad: Abhay Shah, David Skorodinsky, Javier Miguelez, Sarah Doctor, Tian Lan)